mbpp+

p-values

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are not used. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. Hoover over each entry to display the information used to compute p-values.

Typical delta to give good p-values

We can also find the typical p-value for typical difference in accuracy. Hoover to display the actual model pairs for each point.

Pairwise wins (including ties)

Following Chatbot Arena, this is the head-to-head comparisons between all pairs of models, reporting wins, and two types of ties.

Result table

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models, and Elo (technically Bradly-Terry coefficients following Chatbot Arena). These usually have near-perfect correlation.